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Abstract We introduce a robust and fully adaptive method for pointwise estimation in heteroscedastic 
regression. We allow for noise and design distributions that are unknown and fulfill very weak assumptions 
only. In particular, we do not impose moment conditions on the noise distribution, and we allow for zero 
noise. Moreover, we do not require a strictly positive density for the design distribution. In a first step, 
we fix a bandwidth and construct M-estimators that consist of a contrast and a kernel. We then choose 
the contrast and the kernel that minimize the empirical variance and demonstrate that the corresponding 
M-estimator is adaptive with respect to the noise and design distributions and adaptive (Huber) minimax 
for contamination models. In a second step, we additionally choose a data-driven bandwidth via Lepski's 
method. This leads to an M-estimator that is adaptive with respect to the noise and design distribu- 
tions and, additionally, adaptive with respect to the smoothness of an isotropic, locally polynomial target 
function. These results are also extended to anisotropic, locally constant target functions. Our data-driven 
approach provides, in particular, a level of robustness that adapts to the noise, contamination, and outliers. 
We finally conclude with a detailed discussion of our assumptions and an outlook on possible extensions. 
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1. Introduction 

We introduce a new method for pointwise estimation in heteroscedastic regression that is adap- 
tive with respect to the model. The new method is, in particular, adaptive with respect to the 
noise and design distributions (D-adaptive) and with respect to the smoothness of the regression 
function (S-adaptivc). 

Let us briefly review the related literature. The asymptotic normality of M-estimators for the 
location parameter in regular models is proved in the pioneering paper [12]. Later, minimax results 
in nonparametric regression were derived in the series of papers [26-29]. More recently, a block 
median method is used in [7] to prove the asymptotic equivalence between Gaussian regression and 
homosccdastic regression for deterministic designs and possibly heavy-tailed noises. Together with 
a blockwise Stein's Method with wavelets, this leads to an estimator that is adaptive optimal over 
Besov spaces with respect to the L2-risk and adaptive optimal over isotropic Holder classes with 
respect to the punctual risk. This estimator is thus S-adaptive. Additionally, the noise density at 
is estimated, and a D-adaptive estimator is then found with a plug-in method. However, in contrast 
to this paper, only homosccdastic regression is considered and multivariate regression functions, 
in particular anisotropic functions, are not allowed for. We finally mention [24], where a modified 
version of Lepski's method is applied in homoscedastic regression. 
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What is the main idea behind our approach? Consider the estimation of t° <E R in the translation 
model y ~ g(- — t°) for a probability density g. The M-estimator i of t° corresponding to a constrast 
p and a sample 3\, . . . , y n of y is then 



i := argmin^Jp(3^i — t). 



It holds that (see [12-14]) 



^ { t-t°)^M(0,AV), AV:= ] { / ?dG 2 , (1.1) 

(Jp"dG) 

where G is the distribution of y, p' and //' are the first and second derivatives of the contrast p, 
and C indicates convergence in law. In other words, t is asymptotically normal with asymptotic 
variance AV. This result suggests that an optimal estimator is obtained minimizing the asymptotic 
variance. This is the main idea behind our approach. To support this idea further, we recall that 
(see [12]) 

U M«. M )-', (1.2, 



p 



(Jp"dG) 



where 1(G) is the Fisher information for the true distribution G and the infimum is taken over all 
twice differentiable contrasts. This implies, together with the Cramer-Rao Inequality, that an effi- 
cient M-estimator exists. Huber proposed in [12, Proposal 3] to minimize an estimate of the above 
asymptotic variance (since the the distribution G is not available in practice) over the family of 
Huber contrasts (see below). He also conjectured that the corresponding estimator is minimax for 
certain contamination models. More recently, an M-estimator with a contrast that minimizes an 
estimate of the asymptotic variance was introduced for the parametric model, and its asymptotic 
normality was proved (see [1]). As examples, Huber contrasts indexed by their scale and a family 
of £ p losses are treated. In this paper, we consider local M-estimators consisting of a contrast and a 
kernel such that an estimate of the nonasymptotic variance is minimized. We present, in particular, 
a nonasymptotic result which shows that the corresponding estimator mimics the oracle, that is, 
the function that minimizes the true variance. An advantage of our approach is, for example, that 
a data-driven selection of the scale of the Huber contrast provides an adaptive robustness with 
respect to outliers. Additionally, a suitable choice of the support of the kernel can take a maximal 
number of points around Xq into account (cf. [10]). In particular, noncentered or even nonconvex 
supports can be considered. Finally, we show that our estimator is D-adaptive for various sets of 
contrasts and kernels with finite entropy. 



We finally study the problem of S-adaptation. Our main goal is to find a simultaneously D- and 
S-adaptivc pointwise estimator for anisotropic target functions. However, this is not straightforward 
since the standard Lepski's method (see [19, 21]) only applies to isotropic functions. Therefore, we 
restrict ourselves to these functions in a first step. We use Lepski's method for the S-adaptation 
plugging-in an estimate of the minimal variance for the D-adaptation (this is also the case in the 
context of model selection, see [4], or the Lasso, see [6]). This way, we obtain the first estimator 
in heteroscedastic regression with random design and noise distributions with heavy tails which 
is simultaneously D- and S-adaptive and optimal in a sense describe later. Additionally, we allow 
for zero noise (which is detected by the estimator). Furthermore, we note that the application of 
Lepski's method for nonlinear estimators is still nonstandard, and only very few examples can be 
found in the literature ([8, 23, 24]). In a next step, we extend our results to anisotropic target 
functions. For this, we have to restrict ourselves to locally constant target functions. We apply a 
modification of Lepski's methods given in [16, 20] and construct an optimal, simultaneously S-and 
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D-adaptive estimator. This is the first application of such a method to nonlinear estimators for 
anisotropic target functions. Of great interest is, in particular, the corresponding selection of an 
anisotropic bandwidth for applications in the context of image denoising (cf. [2]). Moreover, our 
methods can be applied to establish robust, adaptive confidence bands (cf. [11]). 

The structure of this paper is as follows: In the following section, we first introduce a nonasymp- 
totic variance that resembles the asymptotic variance (see Theorem 1) and then provide a choice 
for the contrast and the kernel (see Theorem 2). We show, in particular, that the corresponding 
estimator is Huber minimax (see Section 2.3). Then, we provide a choice for the bandwidth for 
isotropic, locally polynomial target functions (see Theorem 3) and for anisotropic, locally constant 
target functions (see Theorem 4). After this, we give a discussion on our assumptions and an out- 
look in Section 4. The proofs are finally conducted in Section 5 and in the Appendix, and some 
sample entropy calculations are presented in Section 6.1. 



2. A D-adaptive Estimator for Fixed Bandwidths 

In this section, we consider pointwise estimation in heteroscedastic regression for fixed band- 
widths. In the first part, we define an estimator with a local polynomial approach for a fixed kernel 
and a fixed contrast. In the second part, we additionally allow for the selection of the kernel and 
the contrast via a minimization of the variance of the estimator. Finally, we elaborate on the para- 
metric model and relate to important classical results. 

Let us specify the model beforehand. We assume the observations Z^ n > := (X,-, 1< )<=!.,... n) n £N*, 
to be distributed according to P and to satisfy the set of equations 

Y i = f*(X i ) + <T{X i )Z i , i = l,...,n. (2.1) 

We aim at estimating the target function /* : [0, l] d — > R at a given point xo on (0, l) d . The target 
function is assumed to be smooth, more specifically, it is assumed to belong to a Holder class (see 
Definition 4 below). The target function is obscured by the second part of the above model, the 
noise. The noise variables (£i)iei,...,n are assumed to be distributed independently according to the 
densities gi(-) with respect to the Lebesgue measure on R. The noise densities may be unknown, 
but we assume that Yli9i{') ^ s symmetric and that there exist A €]0, 1] and 7 m i n > such that 

r m '"" "°° n' 1 9i ( z ) dz > A. (2.2) 

"'-TminlMI^ i 

The latter assumption is trivially satisfied with A = 1 and 7 m i n = 1 if IMIoo = (invoking the 
convention 1/0 = oo). We stress that we do not impose, unlike in the literature on the median 
(cf. [7]), any moment assumptions on the noise, and we do not require that the noise densities are 
positive at 0. Indeed, Assumption (2.2) imposes that the density gi{-) has enough mass on the 
interval [-7mm, 7min] (We refer to Section 4 for a more detailed discussion on the assumptions.) 
The noise level a : [0, l] d — >• R+ is assumed to be bounded, but may also be unknown. Usually, 
the noise level is the variance of the noise, however, this is not the case if the noise distribution 
does not have any moments, for example. Finally, the design points (Xi)igi n are assumed to be 
distributed independently and identically according to fi(-). For ease of exposition, we also assume 
that (Xi)i 6 i ! ... ! „ and (£i)i6i,..., n are mutually independent. 



4 



Chichignoud & Lederer 



2.1. Definitions and First Results 

In this part, we introduce an estimator of /*(xo) with a local polynomial approach for a fixed 
bandwidth, a fixed kernel, and a fixed contrast. The properties of this estimator are highlighted in 
Theorem 1. 

As a first step, we set the framework for the local polynomial approach (LPA), described for 
example in [15] and in [30, Chapter 1]. The key idea of the LPA is to approximate the function in 
the neighborhood of the point in question by a polynomial. To start, we consider a hyperrectanglc, 
not necessarily centered neighborhood Vh Q [0, l] d of the point in question xq G (0, l) d such that 
J v dx = rj . hj, where hj is the jth component of a fixed bandwidth h G H := [h m i n , h max ] C 
(0, l) d . The minimal and maximal band widths are given by 

hmin := and h max := [ln^)]" 1 /^^, (2.3) 

where C is a constant large enough such that Conditions 1, 2, and 3 in Section 5.1 are satisfied. 
Additionally, we define for a fixed b G N the set V := {p = (pi, . . . G N d : < |p| < b} with 

|p| = p\ H hPd and denote its cardinality by IPj. For any multi-indexed vector i T = {t Plt ..., Pd G 

R : p G P) G M' 73 ! and for any x G [0, l] d , we then define the desired polynomial as 




Here, 11 is the indicator function, z v := zf 1 ■ ■ ■ z p d d for all z G M. d , and the division by h is under- 
stood coordinatewise. Finally, for a fixed M > 0, we define the set of all polynomials of degree b 
as T := {Pi : t G [-M, Af]l p l}. 

We now introduce the desired estimator of f*(xo). To this end, we first specify what we mean 
by a kernel and a contrast. A kernel (function) K : M. d — > M is a nonncgative function with a 
compact support included in [-1/2, l/2] d , ||i^||oo ^ K max (for a given constant /C max > 1), and 
/ K(x)dx = 1. For ease of exposition, we use the notation Kh{x) :— K ((x — xo)/h) /Y[j hj at 
some points. Next, we specify what we mean by a contrast (function): 

Definition 1. A function p is called contrast (function) if it has the following properties: 

1. p : M — > K + is a convex and symmetric function and p(0) = 0; 

2. the derivative p' of p is 1-Lipschitz on R and bounded: ||/o'||oo < 7max, /or a given constant 

7max > I/ 

3. the second derivative p" of p is defined almost everywhere and is L p n -Lipschitz with respect 
to the measure P for some L p " > 0. Moreover, we assume that ||p"||oo < 1 (without loss of 
generality) and 

p'Lin~ r inf P"(z)>0, 

2e[-7min,7min] 

where 7 m i n > is defined in (2.2). 

Note that Assumption 3 implies that contrasts are strictly convex on the interval [-7mm, 7min]- 
Moreover, for a given A > 0, 7 m i n implicitly depends on the noise distribution via Assumption 
(2.2), and we assume that it is known; its estimation is discussed in Section 4. Well-known contrasts 
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are, for any scale 7 > 0, the Huber contrast (see [12]) 

z 2 /2 if \z\ < 7 

-f(\z\ — 7/2) otherwise 
and the contrast induced by the arctan function (see [26]) 

7 2 

Parc, 7 (z) := iz arctan(z/7) - — Iog(l + z 2 /-f 2 ). 

Note that the absolute loss (cf. Assumption 3) and quadratic loss (cf. Assumption 2) do not satisfy 
the above conditions. However, they can be mimicked by the Huber contrast with 7 small (median) 
and big (mean). We can now combine a kernel and a contrast to obtain the local A-criterion for 
any / G T: 

n 

P„A(/):=n- 1 ^A(X i ,y i ,/), where X(x, y, /) := p(y - f(x)) K h (x), for all x, y e R. (2.4) 

i=l 

The A-LPA estimator f\{xo) of f*(xo) is finally defined as 

fx := argminP„A(/). (2.5) 

The coefficients of the estimated polynomial can be considered as estimators of the derivatives of 
the function /* at Xo- In this paper, however, we focus on the estimation of /*(xo). 

The variance of the estimator is crucial for the following. To state it explicitly, we need to 
introduce some more notation: First, let A' and A" be the first and second derivative of the function 
\(x,y,-) and set U h := Y\ d ]=1 hj, PC, := E( X ,f)~p P n ({X, Y), and A^ := sup,. y f U h \X'(x, y, f)\ = 
1 1 p' 1 1 00 1 1 -K" I |oo- We then introduce the crucial quantity 



Il h y/P[X r (J*)] 3 + K B W) 



-1/4 



1 

We call it nonasymptotic variance since it plays the role of the variance in the risk bounds in the 
theorems below. The explicit expressions of the numerator and the denominator can be deduced 
from 

P[A'(/*)] 2 = J »(x)K 2 h (x) J [p / (a(x)z)] 2 n 1 J2^(z)dzdx (2.7) 

i 

and P\"(f*) = J n(x)K h (x) J p"(cr(x)z)n 1 Y^gi(z)dzdx. (2.8) 

The variance V(A) depends on h, but one can show that this dependence is weak. From Assumption 
(2.2), the strict convexity of p on [-7 m ; n , 7min], and the boundedncss assumption on p' in Definition 
1, we conclude that V(A) < 00. In the particular case h = (1,...,1) (see the parametric case 
below), the nonasymptotic variance V (A) tends towards the asymptotic variance AV(A) defined in 
(1.1) as n — > +00. 

At this point, we can give a first result for the above estimator. To this end, we define the bias 
term of the estimator as 

b h := inf sup - /*(as)|, (2.9) 
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and we introduce the entropy term for all e > as 

B e := 27 jT 1 y/Hr,„(u)Andu + 2 ( (?< ^ )1/4 + ^) + e " ( 2 ' 10 ) 

Hf,v(') is the metric entropy of the set J 7 with respect to the pseudometric 

Kh, h) ■= y/llhP [A'(/i) - A'(/ 2 )] 2 /i,/ 2 e7. 
Then, the follwoing result holds: 

Theorem 1. 7/n is sufficiently large (according to Condition 1 in Section 5.1), it holds that for 
any A 6 A, any h £ %, and for all q > 1 

ErlA^o)-/*^)! 9 

nrV(41n 2 ri) \ 
98'y niax /C max + 4/C max 7 max y 

For a constant C q (C q = 4q\V\ ■ 68 g Gamma(g) works, where Gamma(-) is the classical Gamma 
function). 

Remark 1. We note that we could replace in (2.2) the global quantity \\<j\\oo by the local one 
sup^gy^ |o"(x)|. Moreover, if we additionally impose Condition 3 on n, the second term on the right 
hand side of the above bound is of order o(l/n) and thus negligible. However, we stress that the 
above result is nonasymptotic - in contrast to the classical results of Ruber (cf. [12] and also [26-29], 
[1]). Moreover, we also stress that we do not impose conditions on the design and the noise level 
except for its boundedness. In particular, we allow for degenerate designs and vanishing noise. 
(A more detailed discussion is given in Section 4-) For the proof, we use Bernstein's inequality 
and chaining arguments, in particular, we use deviation inequalities in [22] that rely on Dudley's 
entropy integral. With this, we can recover the shape of the variance, but we obtain an additional 
(large) factor C q Bq. The reduction of these factors is of minor interest for this paper. Finally, for 
further implications of the above result, we refer to Section 2. 3. 

Remark 2. The above bound is, to the best of our knowledge, a new result in nonparametric 
regression. However, the next step is to choose a A that minimizes the right hand side. If we 
neglect the second term, this reduces to a minimization of the variance term B ^V{X)/ v ^hlh 
since the bias term bh does not depend on A. Note that V does not depend on the target function, 
and, in particular, not on the smoothness of the target function. This allows for a wide range 
of applications in various models, for example, in high dimensional settings (see Section 4). The 
adaptation with respect to the smoothness of the target function is finally done via the selection 
of a suitable bandwidth. The simultaneous D- and S-adaptation is difficult since the variance V 
depends on h. We detail this in Section 3. 

2.2. Selection of the Kernel and the Contrast for Fixed Bandwidths 
(D-adaptation) 

How should the combined function A € A, that is, the kernel and the contrast, be selected? We 
introduce an oracle that minimizes the bound in Theorem 1 above and then propose a selection 



<C a 



B 
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that mimics this oracle. We first introduce, for a given set of contrasts T, a given set of kernels IC, 
and a bandwidth h > 0, the set of possible combined functions A: 



A := {A : X(x, y, f) = p(y - f(x)) K h {x), p E T, K E IC} 



(2.11) 



Wc then note that the bias term bh in Theorem 1 is of importance for the choice of the bandwidth 
later. For a fixed bandwidth, however, we can concentrate on the second term only and introduce 
the oracle as 



A* := argminV(A). 

AGA 



To mimic the oracle A* , we then define the estimator A 

( 



(2.12) 



A := argminV(A), where V(A) 

As A 



nh i P, 



A' (fx) +X'oo(nTlh) 



n,)-vA 



V 



PnX"(fx 



Note that we estimate P[A'(/*)] 2 and PA" (/*) by their empirical versions P n A' (/. 



P,A 



"(A), 



(2.13) 



and 



and that estimate /* by f\ and that the explicit expressions for the numerator and 



the denominator are given by 

1 ™ r 

P n [X'(fx)} 2 =- £ KfcXi) p'iYi - f x (Xij) 
n L 

i=l 
1 ™ 

and P„A"(/a) =- V K h {Xi)(/'{Yi - fx(Xi)). 



We now show that the estimator that results from (2.5) and (2.13) performs - up to constants 
as well as the oracle. For this, we define for all z > 



B z := ^1 V 27^ ^A,y(«)Anduj 



1 



1 



-p H^uA,a,(l) + 10V^- 



2- 



(nn fc )i/* ! 



where P^-uA,w(') is the metric entropy of .FU A with respect to the pseudomctric 
"((A.AiM/a.Aa)) 



(2.14) 



:=K/i»/a) V ^/n h P Ai)-k(/ 2 ,A 2 )] 2 V ^/n.p [A'/(A) - A 2 '(/ 2 )] 5 
for any /i, f 2 E T, Ai, A 2 € A, 

A'(/) 



(2.15) 



«(/,A) := 



v /n /l p[A'(/)] 2 + A^/(nn^)i 



and •) is defined above Theorem 1. For example, we give, in the Appendix, the computation of 
this entropy for the family of Huber contrasts indexed by the scale. Then, we have the following 
result: 
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Theorem 2. If n is sufficiently large (according to Conditions 1 and 2 in Section 5.1), then, for 
any h£% and for all q>\, it holds that 

I +o(l/n). 



E/|/a(*o) - f*(x )\ q < 2C q lb h + So^SS 



Remark 3. We stress that our estimator does not depend on the densities (gi)i and {p,i)i and 
the noise level a, and we observe that it achieves - up to constants- the optimal variance V(A*). 
We thus call f-^ D- adaptive optimal. This notion of optimality, however, depends on the family A 
under consideration. 

Remark 4. Via a bias/variance trade-off, we can obtain S-minimax results (with minimal vari- 
ance) with respect to the Holder smoothness (3 of the target function (see Definition 4 below). 
Indeed, we can obtain the usual S-minimax rate ri~^/( 2 ^ +1 ), where ft is the harmonic average of (3. 



2.3. Parametric Case and Huber Minimaxity 



We finally elaborate on the special case of parametric estimation, that is, we assume /* = 
t°, t° £ [-M,M], and consider the model y ~ g(- — t°) for a symmetric density g. In parametric 
estimation, we set the kernel equal to 1 and thus consider the estimator 

1 " 

f p :=arg min -Vpfj.-t) (2.16) 
te[-M,M] n f-j' v ' 

of the scalar t°. 



From the above results, we can now deduce the following corollary: 

Corollary 1. Let p* and p be constructed according to (2.12) and (2.13) with h :— (1,...,1) 

and X(y, t) := p(y — t) for all y € R and t £ [-M, M] . Then, if n is sufficiently large ( according to 
Conditions 1 and 2 in Section 4-1), it holds that 

E t o \t ? - t°\ < 2^0^-52 + o(l/n). 

We note that the constant M does only appear in the residual term and does not play a major 
role in the following. 

Let us relate our results to the Huber minimaxity. For this, we define the set of r-contaminated 
normal distributions for a contamination level r £ [0, 1[ as 

Q T :={G : G = (1 - r)N + rT,T£~}, 

where N is the standard normal distribution and S is the set of all symmetric real distributions. 
The "minimax" variance over this set of distribution is then as follows: 
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Lemma 1. Let the distribution Go be the minimizer of the Fisher information 1(G) over Q r . 
Then, for any r G [0, 1[ 

inf sup AV(p,G) > sup I' l {G) = I' l {Go), 

p GeGr Geg r 

where the infimum is taken over all twice differentiable and convex contrasts and AV is defined in 
(1.1). Moreover, the expression of the density of the distribution Go is 



ffo(z) = < 



exp ( 7r t + 7 2 /2) if t < - 7r 
exp (-i 2 /2) i/ - 7r < t < 7r 

^ exp (- 7r i + 7r 2 /2) if t > 7r 



where 7r is i/ie solution of 

(l-r)" 1 =2 



2tt 



V2 
— p< 

TrV 71 " 



The first claim follows from (1.2) and the second one from [12, Theorem 2]. Lemma 1 shows that 
J" 1 (Go) is a lower bound for the asymptotic variance in the worst case. This asymptotic variance 
can be achieved, as we see in the following result: 

Lemma 2. For any r € [0, 1[ and the Huber contrast PH,7 r as defined in the previous section, 
it holds that the Huber corresponds to the maximum likelihood estimator for the distribution Go, 
PH. 7r (-) = -ln(g (-)) and 

sup AV(pH, 7r) G)</- 1 (G ). 
Geg r 

This is a corollary of [12, Theorem 2]. It means that the estimator constructed with /9H,7 r has 
minimal asymptotic variance for the worst distribution Go in Q r . We may say that / _1 (Go) is the 
asymptotic minimax variance and the estimator constructed with pn,j r is asymptotic minimax. 

Usually, a minimax estimator is desired for an unknown contamination level r. We show that 
it can be constructed with Corollary 1: Set Th := {pH.7 : 1 G [ 7 min , 7 max] } such that 7r € 
[7min j 7 max] for all r G [0, 1[. Then, define 7 as the minimizer of V(p H , 7 ) (see (2.13)) over [ 7m in, 7 max]- 
Finally, define fe^ according to (2.16) with p — ph,7- The resulting estimator has then the fol- 
lowing property: 



Corollary 2. For any r G [0, 1[, $ holds that 

l^- n 1 2Gi Br, 

BlipE*. fey-i \<- / J=JL= 
Geg r ^nI{G ) 



The estimator is thus adaptive with respect to the contamination level r and is (up to con- 
stants) asymptotic minimax in the above sense. This corollary is deduced from Corollary 1 and the 
definition of V(p*). In the following, we then focus to find upper bounds with the minimal value 
of the variance as Theorem 2, that is, the optimality for us. 
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3. A D-adaptive and S-adaptive Estimator 

In this section, we introduce an estimator of f*(xo) that is simultaneously S- and D-adaptive. 
For this, we apply the data-driven procedure introduced above to select the contrast and the kernel, 
and we apply the data-driven Lepski's method to select the bandwidth. Afterwards, we present 
adaptive S-minimax results for this D-adaptive estimator. 

Let us introduce the necessary definitions first. To start, we recall the notion of S-minimaxity. 
To this end, let f n (xo) be an estimator of f*(xo) and S a set of functions. For any q > 0, we then 
define the maximal risk and the S-minimax risk of /„ for xo and S as 

Rn. q [fn,S] := sup E/|/„(x ) - f*(xo)\ 9 and R n ,q[S] ■= mf Rn,q [f, S] , (3.1) 
fes / 

respectively. The infimum on the right hand side is taken over all estimators. We can now define 
the S-minimax rates of convergence and the ( asymptotic ) S-minimax estimators: 

Definition 2. A sequence <p n is an S-minimax rate of convergence and the estimator f is an 
(asymptotic) S-minimax estimator with respect to the set S if 

< liminf <f)- q Rn, q [S] < limsup</>~« R„, q [f,S] < oo. 

n ^oo n — 

Usually, the set S is unknown. In our case, for example, it depends the smoothness j3. More 
generally, S = S m , m G M., for a set of parameters M.. It is then desirable to have an estimator that 
is adaptive with respect to Ai. This motivates the following definition, where ^ := {i/j n (m)} meM 
is a given family of normalizations: 

Definition 3. The family \1/ is called admissible if there exist an estimator f n such that 

limsup sup ijj~ q (m) R n . q (f n ,S m ) < oo. 

n— >oo m£M 

The estimator f n is then called ^ -adaptive in the S-minimax sense. 

The LPA is designed for functions that can be locally approximated by polynomials. This is, 
for example, the case for Holder classes. Similarly as in [3], we define: 

Definition 4. Let /3 := (J) u ...,p d ) e]0,+oo[ d such that |_/3iJ = ... = |_AiJ =: L^J; and let 
L, M > 0. The function s : [0, l] d — > [-M, M] belongs to the anisotropic Holder Class Md($, L, M) 
if for all x,x G [0, l] d 

d 

\s{x) — P(s)(x — xq)\ < L y ] \xj — Xpjl^ and 



J2 su p 



dxl 1 ■ ■ ■ dx p d d 



< M, 



where P(s)(x — Xq) is the Taylor polynomial of s of order at xq 7 and Xj and xqj are the jth 
components of x and Xq, respectively. 
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We distinguish two cases in the following: First, we consider the special case of isotropic Holder 
classes, that is, /3\ = . . . = fid- These classes require only one common bandwidth for all dimensions 
that is chosen with the standard version of Lepski's method (see [19] and [21]). Afterwards, we allow 
for anisotropic Holder classes. These classes necessitate a separate bandwidth for every dimension 
of the domain under consideration. The standard version of Lepski's Method is not applicable 
because it requires a monotonous bias. We circumvent this problem using a modified version of 
Lepski's method as described in [16] and [20]. 

3.1. A Fully Adaptive Estimator for Isotropic, Locally Polynomial 
Functions 

We first allow for functions that can be approximated locally by polynomials but restrict our- 
selves to isotropic Holder classes. Therefore, only one bandwidth h iso = hi = . . . = hd > has to 
be selected. Geometrically, this means that we select a hypercube in R d with edge length hi so as 
domain of interest (in contrast to the anisotropic case where we select a hyperrectangle with edge 
lengths hi,...,hd)- 

A major issue is the choice of the bandwidth. Unfortunately, we cannot apply Lepski's method 
directly since the variance V(A/j iao )/(n/if so ) for (cf. Definition 2.4) 

*h lBO (x,y,f) ■= A(x,y,/) := p(y - f(x)) K hieo (x) for all x,y G K 

(or an estimate of it as, for example, in (2.13)) is not necessarily monotonous with respect to 
the bandwidth (see also the next section and Section 4). We can circumvent this problem with a 
redefinition of the variance term. For this, we introduce the set of bandwidths "H ISO := [/i m in, /imax], 
where h m i n and h max are defined in (2.3), and we introduce the maximal variance for any p G T 
and K G K (see (2.11)) 

V maJ£ (p,JT) := sup V(A fctao ). (3.2) 

The variance V is defined in (2.6) and A/j iso := A is defined according to (2.4) with h := (h iso , . . . , h iso ) 
The modified variance term V max (/o, K) does not depend on /liso- On the one hand, we may lose 
considerably taking the supremum with respect to hi so , on the other hand, this allows us to avoid 
more restrictive assumptions on the design and the noise. This is detailed in Section 4. We now 
define, for any p G T, K G /C, and A, the new oracle as 

05*, IT) := arg min V max (p, K) (3.3) 
and the estimator of the variance as 

Vmsx(ft,K)-= SU P V(AO, (3-4) 

where V(A/j i5Q ) is defined in (2.13). We then select a contrast and a kernel according to 

(p,if):=arg nun V max (p,K) (3.5) 

pST, A £K 

and introduce the isotropic M-estimator as 

ft° ■■= argminn- 1 ^^ - f{X i ))K {hi ^... M {X i ). (3.6) 
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Eventually, we set Hf° := {h iso E W so , 3m e N : h iso = h max e m } , e G]0, 1[, a net on the set of 
bandwidths W so such that \Hf°\ < n and apply Lepski's method for isotropic functions (see [19] 
and [21]) to define the data-driven bandwidth hi so : 



/'is 



ft°M ~ &°(*o) < 20(i?o + iso e (n)) ^^^^ 

v n ( h L) 

for all h[ so G Uf such that h[ so < h iso L (3.7) 



where iso e (n) := ll^/ln(n|7^ so |). 

We now obtain on isotropic Holder classes 

W^{fi,L,M) :=HLi((/3, ...,ff),L,M), for aU /3, L, M > (3.8) 
the following result: 

Theorem 3. for n sufficiently large (according to Conditions 1, 2, and 3 in Section 5.1), xq G 
(0, l) d , /3 G [0, 6], and L > 0, we have 



K n , q [ft°(x ),MT(P,L,M)] <C™ inf. J Lrf fef so + (B„ + iso e (n)) ^ V -^^ I +o(l/n), 
/or a constant Q so (C^ so = ^1 V ^) [40? + 2C 9 ] works). 

This result has the flavor of an oracle inequality: the first term on the right hand side is supposed 
to be a bound of the smallest possible pointwise risk, whereas the second term o(l/n) is, at least 
asymptotically, insignificant. The latter is justified by the following corollary: 

Corollary 3. Under Conditions of the previous theorem and if V max (p* , K*) > 0, we have 

limsup sup — . , : A= -= f K n , q [ft°(x ), EPJ°(A L, M)] < oo. 

n^oo /3>o,l>o \ (B + iso e (n))y / Y max {p*,K*) J 

This corollary can be deduced minimizing the first term on the right hand side of the last theorem 
by the usual bias/ variance trade-off. 



Remark 5. This corollary shows that our estimator is simultaneously S- and D-adaptive. We note 
that this result generalizes results in [7] (that rely on the asymptotic equivalence of the block median 
method) to heteroscedastic regression with random design. We also stress that our estimator does 
not require positive noise densities at their median and thus allows for more general noise densities. 
Additionally, the choice of the contrast is Huber minimax (see Corollary 2 and [12]). We also note 
that Lepski's method has been used for locally M-estimators in [24], but not to locally polynomial 
M-estimators as it is done here. We can finally deduce the rate (ln(rt)/n) ,3 ^ 2,3+1 '' in the above 
result from the entropy calculations in Section 6. 1 . This rate is asymptotically nearly optimal ( see 
[5] and [19]); the additional factor ln(n) is the usual price to pay in pointwise adaptive estimation. 
This is discussed in more detail in Section 4- 
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Remark 6. Note that our estimator detects the presence of noise or not. Indeed, the maximal 
variance (3.2) vanishes when the noise level is zero. The threshold term, in Lepski's procedure 
(3.7), also vanishes and the procedure then selects a small bandwidth (maybe the smallest one). 
Our estimator thus has a small bias and novariance, that is, only a simple approximation of the 
target. 



3.2. A Fully Adaptive Estimator for Anisotropic, Locally Constant 
Functions 

In this part, wc allow for anisotropic Holder classes and for (possibly) separate bandwidths for 
each dimension. In return (see Section 4), we restrict ourselves to locally constant functions, that 
is, b = (and thus \P\ = 1) and T — [-M,M], and we restrict ourselves to the uniform design 
p(-) = 1 with a homoscedastic noise tr(-) = a > 0. We introduce an S- and D-adaptive estimator 
of f*{xo) in this setting and give its main properties in Theorem 4. The results are, in particular, 
applicable to linear estimators, or more generally, to M-estimators with two times differentiable 
contrasts. 



Wc first introduce an estimator for each h € H. For this, we define the variance 

,„ ^ (y/S [p'M^ELiftW^ + llp'lloollA'llco^j-^X 2 

which is independent of the bandwidth h. As above, we then introduce the oracle for a set of 
contrasts T and a set of kernels K. as 



(p*,if*):=arg min V(p,K). (3.10) 
Next, we introduce an estimator of the variance as 

V(p,X):=V(A>wJ, (3-11) 

where V(A) is defined in (2.13) and A^ max (a;, y, /) := p(y—f(x))Kh mBX (x). The data-driven selection 
of the contrast and the kernel is finally 

(p,K) :=arg min V(p,K), (3.12) 

peT, A £IC 

and, similarly to (2.4) and (2.5), the estimator is 

/" := argminn 1 ^ p(Yi - f(X t ))K h (X t ) (3.13) 

i 

for all h £ (0, l) d . It is again necessary that (p, K) does not depend on the bandwidth h; we discuss 
this point in Section 4. 



Eventually, we can describe the choice of the bandwidth h with Lepski's method. For this, 
we define for all a, b £ R the scalar a V b := max(a, b) and for all h, h! £ (0, l) d x (0, l) d the 
vector h V h' := (hi V h'i, . . . , hd V h' d ). We then consider the two families of Locally Constant 
Approximation (LCA) estimators 

U h ] and f/ fc . h ':=/ hvh ') 

L J /ie(o,i) d I- J Mi'e(04) d x(o,i) d 
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where f h is defined in (3.13). Note that f h > h = f h - h by symmetry. Recall the definition of the 
set of bandwidths H := [h m - in ,h max ) d , where h m i n and h max are defined in (2.3). Additionally, we 
introduce an order -< on H such that 



d 



In particular, the variance is decreasing on this order. We finally introduce a net H e ■= {h min } U 
{ft G % ■ Wj = 1,. . .,d, Brrij G N : hj = h max e m:i } , e g]0, 1[ (where we assume that \H e \ < n), 
set ani e (n) := llyJhi(n\H € \), and select the bandwidth according to 



max < ft G He 



f h ' h \x ) - f h '(x ) < 16[B + anL») 



for all /i' G He such that ft' ^ ft I, (3.14) 



where the maximum is taken with respect to the order V max (-, •) and (p, K) are defined in (3.4) 
and (3.12), and B z is defined in (2.14). 



h. 



We can now give the following result for the estimator / 

Theorem 4. If n is sufficiently large (according to Conditions 1, 2, and 3 in Section 5.1), 
x G (0, l) d , (3 G (0, l] d , and L > 0, then, it holds that 



n n , q [f h (x o ),W d 0,L,M)] <C q inf J L V ftf + (B + ani £ (n))^ ^S I + (l/ n ) 
/or a constant C q (C q — |4| g [5gGamma(q) 1152 9 ] works). 

We can also derive the following corollary from Theorem 4 via a bias/variance trade-off: 



Corollary 4. Le£ /3 := (J^j ^/Pj) be the harmonic average. Under the conditions of the pre- 
vious theorem and ifY(p*,K*) > 0, it holds that 

, v 0/(20+1) 

limsup sup — , - " . = , = I Tl n [f h (x ), U d 0, L, M)] < oo. 

n ^°° /3e(o,i] d ,L>o \{Bo + am e {n))^V{p*,K*) J 

This corollary can be deduced minimizing the first term on the right hand side of the last theorem 
by the usual bias/ variance trade-off. 

Remark 7. This transfers the results of the previous section to anisotropic Holder classes. How- 
ever, as opposed to the previous results, the above corollary only allows for locally constant func- 
tions. Moreover, we note that this is, to the best of our knowledge, the first application of [20] 's 
Method to select an anisotropic bandwidth for nonlinear M-estimators. We discuss this in Section 4- 
Finally, we refer to the remarks after Theorem 3. The adaptive S-minimax rate (ln(n)/n)' 9 ^ 2,3+1 '' 
follows from the definition of is nearly optimal. The optimal rate is given by [1 7] in the white noise 
model for anisotropic Holder functions. 



A Fully Adaptive Pointwise M-estimator 



15 



4. Discussion 

Let us detail on the assumptions and restrictions and highlight some open problems: 

1. The symmetry assumption on our model (2.1) (cf. [12], [25]) leads to E/» (y_V p'(£,i)) = 0. We 
stress that we only assume that the sum pj(-) is symmetric. This is satisfied, of course, 
if all densities g%(-) arc symmetric, but this may not be the case. The symmetry assumption 
can be replaced in the proof of Proposition 1 (control of the deviations of M-estimators) if 
the expectation stays very small, that is, E/* (//(£)) < n . To ensure small expectations for 
asymmetric sums of densities, we expect that an asymmetric contrast has to be chosen. This 
seems to be an interesting but hard problem. 

2. It is well-known that the median is very sensitive to the noise density at 0. Indeed, its 
variance is l/(4g 2 (0)). The value of g(0) is estimated in [7], for example, but in practice, this 
requires many observations near the location. On the contrary, contrasts as in Definition 1 
(the Huber contrast with a scale 7, for example) depend on the mass of the noise density 
on the interval [-7,7] (denominator of the variance (2.6)). Moreover, note that the term in 
assumption (2.2) is, up to p^ in , a lower bound of the denominator of the variance in (2.6). 
Therefore, the parameter 7 m ; n can be estimated for a given A similarly as the denominator 
of the variance. The mentioned assumption guarantees the consistence of M-estimators with 
a contrast strictly convex on the interval [-7min, 7mm]- Additionally, if the parameter 7 m ; n is 
chosen as a function of A such that there is a sufficiently large mass is on the appropriate 
interval, it guarantees a good estimation of the variance for all 7 > 7min* However, we note 
that A is expected to require a calibration in practice. 

3. Conditions 2 and 3 on n in the following section are only introduced to simplify the residual 
terms in the proofs. However, the first assumption in Condition 1 is crucial. We recall that 
6/ lmax is the bias and Ap'^ iu is a lower bound of the denominator of the variance, that is, the 
mass of the noise density on [-7min, 7min]- Condition 1 thus means that this mass has to be 
larger than the bias. This ensures that the denominator of the variance is not too small and 
thus that the estimator is consistent (cf. Lemma 7). 

4. To estimate the variance of M-estimators (2.6), we use its empirical version but the residuals 
stays unknown. To solve this problem, we notice Yi — fx is an estimate of a(Xi)^i if and 
only if fx is a consistent estimator of /*. The assumption (2.2) is assumed to guarantee the 
consistence of all of estimators in A. However, a pre-estimator could be used (for example 
the contrast associated to the arctan function as defined below Definition 1) and thus a more 
general family of estimators could be considered (with some of them nonconsistent). 

5. We do not assume any conditions on the design and the noise level except for the boundedness 
of the noise level. The design density and the noise level could be zero or explode at x$. This 
can be detected via the variance (2.6) if the rate of convergence is influenced (see [9] for 
degenerate design). However, the design and the noise level could compensate each other 
such that no effect is visible the variance term. This is a very interesting point and could be 
studied in the future. 

6. Lepski's method is very sensitive to outliers (see [24]). In this paper, however, we chose 
the robustness via the minimization of the variance. This could be interesting for many 
applications. 

7. In Section 3.2, we present anisotropic results for pointwise estimation in heteroscedastic 
regression with heavy tailed noises and random designs. To the best of this knowledge, this is 
the first result of this kind for nonlinear M-estimators in our framework. We note, however, 
that we have to restrict ourselves to locally constant M-estimators because of the bias term 
(cf. Lemma 12). 
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8. We allow in this paper for a selection of the contrast and the kernel from large families. 
Additionally, we can extend the family of contrast allowing for a selection the support of 
the kernel (not necessarily centered at xq). This could be interesting (cf. [2, 10]) especially 
for applications. Furthermore, we could add, for example, the selection of the tail of the 
contrast. Such extensions are only limited by the required convexity of the contrast and 
the complexity of the selection. Indeed, we need that contrast is convex and strictly convex 
around to ensure that the denominator of the variance (2.6) is positive. We think that this 
has to be studied further. 

9. We obtain the desired variance in Theorems 1 and 2 up to the constants Bq and Bq, respec- 
tively (cf. Remark 1). These constants are mostly due to Dudley's integral that is a part of 
the deviation inequalities from [22] we use. We expect that these constants can be reduced 
with a refined analysis. 

10. The variance and the choice of the contrast and the kernel do not depend on the bias term 
(see Theorem 1, (2.13), and Remark 2) and, more generally, do not depend on what we 
estimate. This is an interesting point because this allows for a treatment of other problems 
as in high dimensional settings. In [18], for example, the Hubcr loss with an l\ penalization 
is studied. They show that the shape of the tuning parameter is similar to the variance of 
M-estimators (cf. (1.1)). We thus expect that our results on the choice of the contrast can 
be applied in high dimensional settings. 

11. The simultaneous D- and S-adaptation is a hard problem especially because the variance 
(2.6) depends on the bandwidth which is the parameter of interest in S-adaptation (see 
Section 3). Lepski's method requires a decreasing variance with respect to the bandwidth, 
but unfortunately, this is not always the case in heteroscedastic regression. For example, the 
noise level could be zero in a neighborhood Vh of xq and huge on the set Vw \ Vh, where Vh> 
is a bigger neighborhood of xo. This would imply that the variance increases. To avoid such 
problems, we propose to maximize the variance with respect to h (see Section 3), but this 
is a very conservative approach. Models with a homoscedastic noise and a uniform designs 
do not have these issues (cf. Section 3.2). It may also happen that the design and the noise 
level are such that the variance is decreasing and thus Lepski's method is applicable without 
problems. 

12. From the computation of the entropy (in Section 6.1) and the definition of iso e (n), the shape 
of 20(-Bo + iso e (n)) in the threshold term in (3.7) is Cln(n) where C is a positive and known 
constant but large. An appropriate value for applications is rather between 1 and 2 (see [21]). 
Usually, such quantities are calibrated with cross-validation or similar methods. Moreover, 
we showed in Corollary 3 that our estimator achieves the minimax rate up to a factor ln(n). 
As mentioned in Remark 5, this is due to the threshold term in the selection rule (3.7) and 
is nearly optimal. Indeed, the optimal factor is (b — (3) ln(n) in a certain sense (see [17]). To 
achieve this optimality, the term iso e (ro), in (3.7), has to be proportional to ln(/i max //ii so ) 
(sec [21]). The same remark applies to the anisotropic rate in Corollary 4 (see [17]). The 
optimality of these rates is only proved in the white noise model (see [5, 17, 19]), but we 
conjecture that they are nearly S-minimax optimal in more general settings (for all of models 
where the Fisher information exists, for example). 

5. Proofs of the Main Results 

Let us introduce some additional notation to simplify the exposition. First, we introduce the 
best approximation of the target /* in T: 




(5.1) 
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The minimum is not necessarily unique, but all minimizers work for our derivations. We then set 
t° := t°(f* ,xo,h) := {t® : p e V} and f° := P t o. Next, we denote the vector of the monomials 
(x — Xq) p /h p of order smaller or equal than & by X and the smallest eigenvalue of the matrix 
J X T Xfj,(x)K h (x)dx by v. This allows us to define the set 

T K :={/ = Pt : ||* — < 5„} , (5.2) 

where 

S n := 2\V\ 3/2 (AC il ^y 1 (Vn)" 1 +b h J K h (x)n(x)dx) . 
Furthermore, we denote the vector of partial derivatives of the A-criterion P„A(-) (defined in (2.4) 

by 

(-ET P »H-)) (5-3) 



v ®tp / per 

and the corresponding expectation and the "parametric" expectation with respect to the distribu- 
tion P° of (X, f°(X) + <r(X)€) by 

p[Dx(-)] and P°[D X (-)}, (5.4) 
respectively. Next, we introduce the Jacobian matrix Jd of P° [D\] as 



= ( A P o 

at 



(55) 



where is the pth component of D\(-). The Jacobian matrix exists according to Defini- 

tion 1 and Fubini's Theorem. Furthermore, the sup-norm on R^l is denoted by || • H^, and 
the vector of coefficients of the estimated polynomial fx is denoted by t\. Finally, we set a n := 
max {\/&/ imax , (Inn)" 1 } and 



(I + a n )y/l + a n 
a n := . w == , (5.6) 

(1 - 0„)V1 - a n 

and 

c x := PA"(/*). (5.7) 
By Definition 1, it holds that inf \ e \ c\ > 0. 

5.1. Conditions on n 

Condition 1: We assume that n is sufficiently large such that for all A € A 



y/bh^ + S„ < - A -^f and 7max AC max flo < ^/n< in (2 Inn)" 1 . 
Condition 2: We assume that n is sufficiently large such that for all A € A and all h E H 

(2V V )7max/C max (^+& /t )+ 7 "^^A(») < ^ max f ^ f (x)/i(x)(ia;, II/, inf P [A'(/*)] 2 1 

and a„ < 1/3, where a n is defined in (5.6). 
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Condition 3: We assume that n is sufficiently large such that 

2 



2]n(n)< n< in /(41n 2 n) lly^Mg 2 7max K max 



5.2. An Auxilliary Result 

Proposition 1. Let A be a set of functions as in (2.4) such that Hj?uA,u < oo, and let n be 
sufficiently large (according to Condition 1 above). Then, for any z > and any h G %, 



sup 

AGA 



\f\{xo) - f {xo)\ - 2- 



> 3b h S n 

J AeA 



f) {AeJj,} <2|P|exp(-z), 
\eA / 



where B z is defined in (2.14). 



Recall that A depends on the bandwidth h, which is fixed here. We also note that the constants 2 
and 3 can be replaced by o(l). Finally, if only one fixed function A £ A is considered, the expressions 
simplify considerably as we show in the following lemma: 

Lemma 3. Let A G A be fixed, Hj^ tV < oo, and let n be sufficiently large (according to Condition 
1 above). Then, for any e > and any h £14, 

P/- M/aOco) - TM > 2^=A + 3ft / A G jr ] < 2 |P|exp |- inn| g2 4e ) , 
where B e is defined in (2.10). 



This claim can be deduced similarly as Proposition 1, but one has to choose z such that e 

(nn h )!/4 ■ 



5.3. Proof of Theorem 1 

First, we recall that sup^ g:F ||/||oo < \T\M and set 

:= jVA S A, /a S J>„| and f! c := {aA e A, / A £ Jj„ } . (5.8) 
Then, since /aGJ and ||/*||oo < M, the risk can be bounded by 

%* |/a(x ) - /*(xo)| 9 = E/. |/a(x ) - /*(x )rin +E / .|/ A (ajo) - /* M%< 
< E/.|/a(x ) - f(x )\ q tn + ((1 + \V\)M) q ¥ r (n c ). 
Using Lemma 7, Lemma 8, the last inequality, and simple computations, we obtain 
E r \f x (xo)-f*(x )\ q 



< E f .\f x (x ) - f* M% + ((1 + \r\)M)"2\r\ exp 



nLV(41n 2 n 



987max^max + ^max^n 



2^V(XjBo\ q , , 09 ( q j , 2 v ^(A)Bo' 
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< / A (s ) - /* (x )| -2b h - v A^L ) In + 2 f ' 36 



(P + PI) Wl oxp | - 18l ,yy t ) • (5.9) 

vo )max /v max ~ ^ /max'^max / 
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Let us now bound the first term on the right hand side of the last inequality. To do so, we use 
simple computations to obtain 



9 



E/- f |/a(x ) - f*(x )\ - 3b h - 2v/ Y^y° | 



{\f\(x )-r(x )\-3b h -^^^>z, n)dz. 



Setting z = 2 =fff- £ in the last inequality, using the definition of B £ , and Corollary 3, we get 



'I 



%* ( |A(*o) - /*(xo)| - 36, - 2 ^y° ] l n 

- y| 1 1 I Jo v ioo +4 £ ; 



<2 (7 |P|(3fe /i+ 2 ^M^) / £ ,-i cxp (___),fe. 



One may then check that for any a, b > and any q > 1 

/ e 9-i e -5+5i d e < 1 + (a + 5) 9/2 Gamma(g), (5.10) 
Jo 



so that 

9 



( | /a (»o) - f*(xo)\ - 3b h - 2 vyW g ° ] tn < 2q \V\ l'Sbh+ 2 %^ BQ I (11.2)3 Gamma(g) 

where Gamma(-) is the usual Gamma function. From (5.9) and the last inequalities, the theorem 
can be deduced. I 



5.4. Proof of Theorem 2 

First, we set for all h € H 



xeA 



(5.11) 



Then, we observe that, since £ J and ||/jJ|oo, ||/*||oo < M and sup^gjr ||/||oo < 1^1 Af, the risk 
can be bounded by 

%• I/a^o) - /* M' = %* \f\( x o) - f(x )\ q lA+E f ,\f x (x ) - f*(xo)\ q l A ° 
< E f ,\f x (x ) - r(x )\ q lA + ((1 + \V\)M)*F f .(A°). 
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Using Lemma 9, Lemma 8, the last inequality, and simple computations, we obtain 

E f .\f x (x )-r(x )\ q 

< E/» \f x (x ) - f*(x )\ q t A + ((1 + \V\)M) q (2n 2 + P/.(fi c )) 

1 



<2*E f .[ \f~ x (x )- f*(x )\-3b h 



2a n ^/V(y)Bo 
y/nllh 



1a 



2 q 3b h 



2a nV / V(M)B Q * 



+ ((1 + \V\)M) q (2/n 2 + F f , (fO) . (5.12) 



Let us now bound the first term on the right hand side of the last inequality. To do so, we use 
simple computations to obtain 



%• \ffco)-f*(zo)\-Mh 



2a n ^/V(X*)B Q 
VnUh 



1a 



q / (zy-'Pf* \h(xo)-r(x )\-3b h 



2a n ^V{^)B Q 



>z',A]dz'. (5.13) 



On the event A and by definition of a n in (5.6), it yields 

(1 - a n )\/l - a r] 



(1 + a n )y/l + a r , 



V(A) = a;VV(A) (5.14) 



Setting z' = - ari \ // ^ x l £ m (5.13). defining B t := Bq + e, using the definition of B e , the last 



inequality, and Proposition 1 with e = 10\/z - 



2z 



7 -f* \f\( x o) ~ f*( x o)\ - 3b h - 



2a ny /V(\*)B 



1 

1a 

+ 



, we get 



y/nU h 



( \k(xo) - f*(xo)\ > 36 h + 2a Wn^)B £ ^] d£ 



yjnll h 



< 



< 



2a n ^JV(X) 



P/* \h(x )-f*{xo)\>3bh + 



2JV(\)B £ 



VnTLf, 



tt de 



2a ny /VW) 



< 2q\P\ 



y/nll h J 
( 2a nV /\7(V0 



£ q - X ff, SUP 



e q 1 exp 



\h{x )- r{x )\ -3b h - 



de 



>o,n\de 



100 + 4e 



< 2q\P\ 3b h 



< 2q\V\ 3b h + 



2any/vW)Bo 



e q 1 exp 



100 + 4e 



de 



2a n ^/V(X*)B Q 
y/nU h 



(11.2) 9 Gamma(g). 
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The last inequality is obtained from (5.10). From (5.12) and the last inequality, the theorem can 
be deduced. ■ 



5.5. Proof of Theorem 3 

For ease of exposition, we set k := hi so and k := h- lso . Then, one may verify that the oracle 
bandwidth 



k* := arg min < Ldfr 1 ^ + 2 



y / Vmax(;Q*,ir*)(-go + iso e (n)) 
dVnk 3 



is well defined. Moreover, let us introduce the element k* of the net Hf° such that k* < k* < e _1 fc*. 
Furthermore, from Condition 3 on n and Lemmas 7 and 9 with h = (k, . . . , k), it follows that 



V (3k G U?° : /*, i T 5n ) < 2\V\ J2 ex P 



nhi-J(Aln 2 n) 



< 217V 1 (5-15) 



and 



Pf* (a c ) < J2 

ken 1 *" 



fce« 
<4|7V\ 



2\V\ CX P 



fee-Hi 



n< in /(41n 2 n) 



(5.16) 



where h nlin and A are defined in (2.3) and (5.11), respectively. Thus, we may restrict our consid- 
erations in the following to the event f\-e« is ° 

Control of the risk on the event {k* < k}. With the triangular inequality and Lemma 8, we 
obtain 



\ftM-r(x )\H 



k'<k 



/tao(*o)-/£(»o)|%<t +2^ 1 E / ,|i' fc i(xo)-r( a; o)| 9 . (5.17) 



The first term on the right hand side of the last inequality is controlled by the construction of the 
procedure (3.7), and thus 



/£o(*o)-/£(*o)ri 



k;<k 



<E f , 



20 



V max (p,/O(S +iso e (n)) 



\/n(k* 



On the event H/tp^/ 1 ™ ^> we S e t similarly as in (5.14) 



|/L(zo)-&o)rw 



C 20 



V / T + ^ VVmax(p*,^*)(S + iso £ (n)) 



1 - a n 



y/n{kt)° 



where a n is defined above (5.6). Recall that, by the definitions of the Holder classes (Definition 4), 
we can control the bias for any j3 G]0, b] and any k > by 



b k < sup |P(/*)(z - aro) - /* < id^, 



(5.18) 
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where P(/*)(a; — xq) is the Taylor Polynomial of /* at xq. So we can deduce finally from Theorem 
2 with h = jk, . . . , k) and bh = b^ a bound for the second term in (5.17): 



/3 y/V max.jp*, K*) Bp 



ojl/n) 



Using (5.17) and the last two inequalities, and invoking Condition 2 in Section 5.1, we have a 
control of the risk on the event {k* < k}: 



Ef. 



ftM-f'jx )\ q l k . 



k*<k 



< [A0 q + 2C q ] Ldjk' t 



y 3 | VV m ax( j 0*,A'*)(g +isO e (n)) 

v^jktr 



+ 0(1/77). (5.19) 



Control of the risk on the event {k* > k}. In order to control the risk on the complementary 
event, we observe that 



E f . 



ftjx ) - f*jx )\H k , >k ] < ((1 + \V\)M)*P f .{kt > k). 



(5.20) 



We now show that the probability P/»(fc* > fc) is small. According to the construction of the 
procedure (3.7), we have 



V(fc e * > k) < Ff, I 3k' eH,k>< k* : f^jx ) - g o jx ) 



> 20 



V max (/5,A')(B +iso e (r7)) 



f-!;M-rjx ) 



< 2 Yl p f 

k'eHf" : k'<k* 

On the event C\ke-H is ° ^> we S et similarly as in (5.14) 



> 



20 V V 

max 

jp,K)jB + iso e jn)) 



jk* e >k)<2 



f* 



fLjxo)-f*jx ) 



> 10 



\fi - a n v / Vmax(/3, K)jB + iso e (n)) 



k'£H?° :k'<k* 



1 + a„ 



^njk') d 



where d n is defined above (5.6). According to Condition 2 in Section 5.1, we have a n < 1/3. 
Consequently, 



,jk* £ >k)<2 Y 



f* 



f? o jx )-f*jx ) 



> 5 



fc'GWi so : fe'<fe* 



^/V ma .Ap,K)jB + iso e (n)) 



(5.21) 



By definition, the oracle bandwidth k* is the one which gives the best trade-off, so that for all 

k' < k* < k* 



Ldjk* f < Ldjk* f = VV^xjP* > K *)j B o + is °c( n )) < V v max(p*, K*)jB Q + iso e (n)) 



\/njk*y 



yjnjk*) c 



< 



< 



y/Vmax(/5*, A'*) (.Bp + iso e (n)) 
^fnjk') d 



y / Vmax(/3, K)jB + iso e (n)) 

^jk~r 
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From (5.18), (5.21) and the last inequality, we get 

>k) 



23 



jr/. 



< 2 



E 

ri a o . fc 

E 



k'£HK s ° : k'<k» 



*/. sup 

\p.A- 



/L(^o)-r(xo) 

/ta>0=o) -/*(»<>) 



2 VVmax(p,^)(gQ+iS0 e (n)) + ^ 



- 2 



^n(fc') d 

VVmax(j9,i^)(gQ+iS0 e (n)) 



> 36* 



Using iso e (n)/(n(/c') d ) 1/ ' 4 < 1 (see Condition 3 on n), the definition of iso £ (ri), Proposition 1 with 
h = (k' , . . . , k'), V(A) = V max (p, K) and z such that B z = (Bq + iso e (n)), we obtain 



k'enf" ■. k'<k: 



(iso e (n)) 5 



100 + 4iso e (n)/(n(fc') d ) 1/4 
Then, in view to the last inequality, (5.15), (5.16), (5.19), and (5.20), we conclude 



< 417V 1 - 



%* |/\*o) - /* M ? < 2- 1 [40< + 2C q ] Ld( K f + VV^^^+iso^ + ^ 



By definition of fc* and /c* in the beginning of the proof, the theorem is proved. 



5.6. Proof of Theorem 4 

One may verify that the oracle bandwidth 

,* • f rWiq ^ , o V /V(p*,A-*)(i3 + ani £ (n)) 
ft := argmm < L > p„ (hj) ' 3 + 2 

is well defined. Moreover, define the element h* of T-L e such that for all j = 1, . . . , d, /i* ■ < h* < 

t' Y h\y We then note that the estimator Z' 1 is a constant function and /° = f*(xo) since we only 
consider locally constant functions (\P\ = 1). To stress the importance of the bandwidth, we set 
for any h € H 

V h (.) := D- Xh (.) = n 1 Y,p'{y i - -)&h{Xi) 

i 

and 

V h (-) := P [D- Xh (-)} = J K h (x) J p'(az + f*(x) - ■)G(z)dzdx. (5.22) 

Here, X h (x,y,f) := p(y - f(x))K h (x), G(-) := .%(')> and and D x {-) are defined 

in (3.12) and (5.3), respectively. Next, for uniform designs and homoscedastic noise levels, the 
quantity c\ h simplifies for any A^ to 

c\ h = c p := / p" (az)G(z)dz. (5.23) 
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Moreover, according to Lemma 10, we have for any two constant functions /, / £ !Fs n 



1/ - /I < (1 + 2y/b haiBX +S n ) mfcf\V h (f) - V h {f)\. 

hen 

Furthermore, from Condition 3 on n, Lemma 7, and Lemma 9, it follows that 

nUh/(4: In 2 n) 



3heH e : f h ^T s ) <2 ^ 



exp 



h£-H, 



< In 1 



and 



n/ijL iT ,/(41n n) 



< 4n"\ 



(5.24) 



(5.25) 



(5.26) 



where A is defined in (5.11). Thus, we restrict our considerations in the following to the event 
j/' 1 £ Ts n for all h £ H e |n A. Moreover, we work on the event A := {/i* ^ /i} and its complement 

A c separately. For this, we decompose the risk into R A (f h , f*^j '■= E/* \f h (%o) — f* (%o) 

and R M (f h ,f*) := E/» f|/ h (s ) - /* (*o) \ q t{A c } 



Control of the risk on the event A. With the triangular inequality and Lemma 8, wc obtain 



ra (f k ,r) 



< y- 1 



RA[f h: ' h J h )+RA[f h ' K J h: ) +RA(f K ,f 



(5.27) 



Let us now control the first term on the right hand side of the last inequality. First, we observe 
that 

RA(f K ' h J h ) <E r sup \f h *' h (x ) - f h (x )\ q l A . (5.28) 



hen : h>~h* 



To simplify the presentation, we introduce the notation t„ := (1 + 2^/6/i max + S n ). Using (5.24) 
and taking / = f h f h and / = / , we then have 

\f K ' h (x ) - f h (x )\ < r n cf V h (f h <> h ) - V h (f h ) 

Recall that, by definition, T ) h(f h ) = for all h e H. Wc then obtain from the last inequality for 
any h £ % 



\f K - h (x ) - f h (x )\ < T n cf (\v h (f h *' h ) - V ht vh{f K ' h ) 

+ \D Kvh (f h *< h ) - V K vh(f h "" h )\ + \Vh{f h ) - MP 
Using the last inequality and (5.28), we have 

RA(f h>K J~ H ) <2^ l T^ f ,c? sup sup \V h (f)-V Kvh (f)\ q 



(5.29) 



h€H c feFs, 



2H« Y, E /* c 7 SU P VhW-Vhif) 



Using Lemma 11 and Lemma 12 with h! = h* and p = p, we get 

d 



RA(f h ' K J h ) <2*"V^ 

j'=i 



Pi 



2*t*C 9 



n 



ny /V{p*,K*)(B + anie(n)) 
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According to Conditions 1 and 2 in Section 5.1, we have t„ < 2 and a n < 2y/2. Thus, 

W x y/V(p*,K*)(B Q + 8^(71)) ^ 



R A {f h ' K J h )<2 4 ' 1 C q [Lj2(K, j y 

3 = 1 



(5.30) 



The second term on the right hand side of (5.27) is controlled by the construction of the procedure 
(3.14), thus 



Ra[I KK J K ) 



16 



V(p, K){B + &ni e (n)) 



1 A . 



On the event A, 



RA[f KK ,f K ) < I 16 



.VTTaTt ^V{p*,K*)(B + ani e (n)) 



1 - a n 



(5.31) 



where a n is defined above (5.6). By the definition of the Holder class (Definition 4) and bh (Defi- 
nition (2.9)), we can control the bias for any f3 : \_/3\ < b and for any h € 7i: 

d 

b h < sup \P(f*)(x-x )-r(x)\<Lj2hf, 

xev h . =1 

where P(/*)(a; — Xq) is the Taylor Polynomial of /* at xq. Finally, with Theorem 2, we can bound 
the third term in (5.27): 



RA(f h ',r)<2c q [l^k.y 



y/V( P *,K*)B 



■o(l/n) 



Using (5.27), (5.30), (5.31), and the last inequality, and invoking to Condition 2 in Section 5.1, we 
have a control of the risk on the event A: 



RA{f h J\ 
< 3 9 - 1 [2 iq C q +32« 



/n). (5.32) 



Control of the risk on the event A c . In order to control the risk on the complementary event 
A c , we observe that 



Ra* (f k J*) < {2MYF f .(A c ). 



(5.33) 



We now show that the probability P/* (^4°) is small. According to the construction of the procedure 
(3.14), the event A c implies that there exists a h! £ % such that h! < h* and 



f h: ' h '(x )-F(x ) 



h' 1 



> 16 



V(p, K){B + ani e (n)) 



\ATT 



Using (5.24) and taking / = f h c h and / = f h , we have on the event A c 

Tn4 V h ,(f h *' h ')-V h ,(f h ') >16 



V(p,k)(B + & m e (n)) 
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From the last inequality, we obtain (cf. (5.29)) 



r n cf sup \V h ,(f)-V h , yh ,(f)\+2T n cf sup V h ,{f)-V h ,{f) 



Together with Lemma 12, this yields 

d 

tIL^KjY 3 +2r n cf sup V h ,{f)-V h ,(f) 

j=l f^n 

On the event Clhen ^> we S et similarly as in (5.14) 



> 16 



V(p, A)(-B + anL») 



> 16 



V(p,K)(B + am e {n)) 



3 = 1 



> 16- 



VI - a„ V V(p, K )(B + ani e (n)) 



1 + a,, ; 



where a„ is defined above (5.6). According to Conditions 1 and 2 in Section 5.1, we have t„ < 2 
and a„ < 1/3. Consequently, 



c- sup 



V h ,(f)-V h ,(f) 



l 6v /V(p,^)(So + ani e (n)) <* 



By definition, the oracle bandwidth /i* is the one which gives the best trade-off, then for all 



L^Tth*)^ < L^(h* = VV(P*' K *)( B " + ^Mn)) < y/V(p*,A*)(i? + anL>)) 



3 = 1 



3 = 1 



< 



< 



V(p,K)(B + am e (n)) 



y/nllh' 



From last two inequalities, we obtain on the event A c 



c'p sup 



> 



V(p,K)(B + ani e (n)) 



\/nTl h , 



V h ,(f)-V h ,(f) 
Then, we have a control of the following probability 

f> h '(f)-v,Af) 



h'£H € : h'<hl 



P,K feFg n C p yjY(p, K) 



> 



B + ani e (n ) 



Using ani e (n)/(nll/j/) 1 / 4 < 1 (see Condition 3 on n) Lemma 6 with z such that B z = Bp + ani e (n), 
we deduce that 



v(-4 c )< E ex p 



h'eH e : h'<h* 



(ani e (n)) 5 



100 + 4ani e (n)/(nIL l ,) 1 / 4 



< n 
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From (5.33) and the last inequality, we obtain a control of the risk on the event A c : 

Ra< (?'\.r) < (2M)%"\ 
Then, in view of the last inequality, (5.25), (5.26), and (5.32), we conclude that 

E f .\f h (x )-r(x )\ q 



< 3- 1 [2^C q + 32< + 2C q ] ( L J>:,^ + ] + o(l/n) 

With the dchnition of ft* and /i* in the beginning of the proof, the theorem can be deduced. 



6. Appendix 

6.1. A Entropy Calculations 

First, let us give a bound for the entropy Hjr v (defined below (2.10)) and its Dudley's integral. 
For this, we recall that the metric entropy of a set is the logarithm of the minimal number of balls 
(with respect to the corresponding metric) needed to cover the set (see, for example, [31]). For any 
v G]0, 1], we then have 

Hr,M < \P\ 1» f^r) and f \/ H ^A u ) A ndu ^ V[P\ H 2M ) + VW\ j VHv)/(2v 3 )dv. 

We now give a bound for the entropy ff/uA,u, that is, (defined below (2.14)) for the special set 
of Huber contrasts indexed by the scale 7,Th := {p = pn,-y ■ 7 € [7min, 7max]} • Here, the positive 
constants are chosen such that 7 m ; n < 1 and 7 max > 1. In this example, we do not consider the 
choice of the kernel, we just take the indicator function as kernel and K, = {l[-i/2.i/2] d (')}- In this 
case, we have A = Th- For v €]0, 1], we finally give the following bound 

#fut=, w (») < (1 + W\) In [ 16 [12 V 9oo ] - [2M V ^hlA 



4 2 
'mm 



and 



^/uTh^W A ndu <y(l + \V\) In (l6[l2 V 9oo ] [2M V 7 



7max 
max I 4 

/min 



+ y/l + \V\ J ^Hv)/(2v 3 )dv. 

Here, := sup i=1 n ||fifi||oo where (g,), are the noise densities in the model (2.1). 
6.2. B Proof of the Auxilliary Result 

Proof of Proposition 1. The definitions of f\ and f° (see (2.5) and (5.1), respectively) imply 
that 

|A(zo)-/*M = I(*a) .„o — *8 ol < II*a-*°IL- 



28 Chichignoud & Lederer 

Using fx £ J 7 s„ , Lemma 4, and the last inequality, we have 

l/A^oj-r^oji^a-^)- 1 ^ 1 !!^^^)]-^^/ )]!!^. 

Recall that by definition D\(f\) = and P°[D x (f)] = 0. Thus, for all A € A such that fx e F Sn , 
the last inequality implies 

\h(xo)-nXo)\ < (l-^)- 1 CX 1 (||^(/A)-i , [^A(A)]||^ + ||PpA(/A)] -^[^(A)]!!^) 

From Lemma 5 and the last display, we obtain 

| Afro) - /*(*o)| < (1 - vC)" 1 ^ 1 (||£>a(/a) - P[D x (fx)] Wioo + (1 + V&fc + Sn)cxb h ) 

< ^y ^+ti-^)- 1 sup ^ii^^-p^a)]!!^. 

t - V On /GJ r 5„.AeA 



As v5/T+^n < 1/2 according to Condition 1, this yields 

\h(x ) - f*(x )\ < 3b h + 2 sup d£\\D x (f) - P[Dx(f)] \\e^ 

/e^5„,AeA 

From the last inequality and the definitions of V(-) and c\ introduced in (2.6) and (5.7), respectively, 
we deduce 



|/A(x )-r(xo)|-2 



>3bAn fl {Ae^} 

J AeA 



< 



< 



sup 

,/6^5„,AeA 



sup 



2c- a 1 ||^a(/)-P[^a(/)] 11^-2^^^ 



> 



\D x (f) - P[D x (f)]\\ ia 



> 



D- 



/^ s „,aga ^^/ P [A'(/*)] 2 + A^/(nn,)i/4 
Using Lemma 6 and the last inequality, we finally obtain 



V AeA 



|A(a!o)-/*(xo)| -2 



>3b h \n 



fl {/a6^} <2in 

AeA / 



6.3. C Technical Lemmas 

We first give a result for the deterministic criterion P° 



D x (-) defined in (5.4): 



Lemma 4. For any A € A and any h G H, and for n sufficiently large (see Condition 2 in Section 
5. 1 ), the following holds: 



1. P° 



D\(f ) = 0' an d the function P° D x (f) is bijective as function of J 7 s n (see Defir 



tion (5.2) ) on the corresponding image. 
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2. For any f, f £ 7f„ , 

||* — *|u_ < (l-^-^^iipo^A^-p ^^)]!!^, 

where f = Pt and / = P t ~. 

Next, we consider the bias: 

Lemma 5. _For any h £ % and any A £ A, i/n is sufficiently large (see Condition 2 in Section 
5.1), it holds that 



sup \\P°[D x (f)]-P[D x (f)]\\ < (1 + V&fc + S n )c x b h . 

The following lemma allows us to control the deviations of the process D x (-): 
Lemma 6. For any h £ %, it holds that 



f* 



\\D x (f) - P[D x (f)]\\ eoo B z , 

sup > -== | < 2\V\cxp{-z). 



Now, we bound the probability of the event "the A-LPA estimator does not belong to the ball 
centered on t° with radius 8 n " ■ 

Lemma 7. For any h £ %, if n is sufficiently large (according to Condition 1 in Section 5.1), it 
holds that 

nrV(41n 2 n) 



V (fi c )<2|P|exp 

where fl c := |3A £ A, f x £ ^5„| , and 7 max and /C max are defined in Section 2. 

Next, we do some simple algebra. 
Lemma 8. For any x, y £ R^~, it holds that 

x 9 < 2 q [x-y} q + + 2 q y q . 
Moreover, for any l,q £ N* and X\, . . . , xi > 0, it holds that 



EH ^ E 



X 
i=l 



The following lemma allows us to get our hands on the estimator V(-). 

Lemma 9. For any h £Ji, if n is sufficiently large according to Condition 2 in Section 5.1, it 
holds that 

P/. (A) > 1 - 2/n 2 - PHin 



^ereA^n^Ajv^e VW), VW) 

is defined in (5.6). 



f2 c is defined in Lemma 7, and a n 
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We now consider functions near the target /*. 

Lemma 10. Let 2?^(-) : [-M, M] — ¥ K and Cp be as defined in the proof of Theorem 4 and assume 
f* G Wd{/3, L, M) and n sufficiently large (according to Condition 1 in Section 5.1). Then, for any 
t,t G [f*(xo) — $ n ,f*{xo) + S n ], it holds that 

\t - t\ < (1 + 2y/b hBm + 5 n ) inf cf\V h (t) - V h (t)\. 

Next, we controll the distance of t>h(f) to T>h(f) for appropiratc bandwidth h and functions /: 

Lemma 11. For n sufficiently large (see Condition 3 in Section 5.1 ), it holds for any h G H that 

a n ^V(p*,K*)(B a + &m £ (n)) \ q 
VnTh ) 

for a constant C q (C q = Aq2A q Gamma(g) works). The functionals T> and T> are defined in the 
proof of Theorem 4, Gamma(q) is the classical Gamma function, V(p* , K*) is defined in (3.9) and 
(3.10), ani e (n) is defined in Section (3.2), and a n is defined in (5.6). 

Eventually, we look at the distance to T>h>vh(f) to "Dh(f) for appropiratc bandwidths h and hi 
and functions /: 

Lemma 12. For any h' G H, any f* G HLj(/3,L,M) such that (3 g]0, l] d , and for n sufficiently 
large (according to Condition 2 in Section 5.1), it holds that 

d 

sup sup - V h {f)\ < (1 + VSn + b^JcpLYiti^, 

hen feF Sn 

where T>h and cp are defined in (5.22) and (5.23) in the proof of Theorem 4- 
6.4. D Proofs of the Technical Lemmas 

Proof of Lemma 4. Let us proof the first claim. For this, we note that the components of 
P°[D x (f)] arc given by 

P a [D{(f)} = f (^p-X fi(x)K h (x) f p'(a(x)z + f(x)-f(x)) ±£ gi {z)dzdx. 

^ ' i—l 

Since p and X^3«(') are symmetric, it holds that / p' (z)^2 i gi(z)dz = and P°[-D^(/ )] = 0. 
We now show that P°[_D^(-)] is injective on the image of J r s n exploiting further the symmetry of 
p(-) and Consider /,/ G T Sn such that P°[D x (f)] = P°[D x (f)] ■ We have to show that 

f = f. For this, we first note that 

5> p - t p ) (p°[^(P t )] - P°[^(P f )]) = 0, 
per 

where t and t are such that Pt = / and Pj = /. To simplify the presentation, we introduce the 
notation u(-) := (f-f°)(-), u(-) := (/-/°)(-), and G(-) := n 1 Yh=i &(')• Sincc is symmetric, 



E 



sup 

f^s n 



V h {f)-V h (f) 



< 



n\H f 
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K is nonnegative, and p' is odd and positive on Ml, the last display implies 

Kh(x)fi(x)[u(x) — u(x)] J [p'(a(x)z — u(x)) — p'(cr(x)z — <G(z) dz dx = 

<=> / Kf l (x)/i(x) \u(x) — u(x) | / |p'((r(a;)2: — — p' (a{x)z — u(x)) | <G(z) dz dx = 0. 



As /, / G J 7 ^ , it holds that sup.,. e y |w(x)| V \u(x)\ < 5 n . Moreover, using the mean value theorem, 
the P-continuity of p" , Assumption (2.2), inf z g[_ 7mini7min ] p"(z) > and \f8^ < § A ^ m '° , we obtain 



> 



> 



> 



A"ft,(x)/i(a;)|w(a;) — / |/9'(cr(x)^ — — p'(cr(x)z — u{xj) | G(z) rfz 

/A/i(x)/z(x)|w(a;) — u(a;)| 2 inf / p" (a{x)z — s) &(z)dzdx 
/ Kh(x)fj,(x)\u(x) — u(x)\ 2 inf / p" (a(x)z — s) G(z)dzdx 

J s:\a\<S n J 

Kh(x)fi(x)\u(x) — u(x)\ 2 dx / p" (a{x)z) G(z)dz — 8 n L p ii 



> 



/ K h (x)p(x)\u(x)-u(x)\ 2 dx, 



where A,j min > are introduced in Assumption (2.2) and p'^ in in Definition 1. The last display 
and the positivity of K over its support yield sup^gy \u(x) — u(x) | = 0. As u and u are polynomials 
with finite degree, we finally obtain that f = f, and the first claim is proved. 

Let us now turn to the second claim. We set D(-) := P°[D\(-)\ and note that D(-) is differen- 
tiable and injective on Ts n (the latter according to the first claim). We can consequently find an 
inverse of the function D on the image of D on Ts n . We then obtain, denoting the matrix £oo-norm 
by HI • 1 1 |oo and the inverse of D by D' 1 , for all / £ 



\J D -A!)\\V 



iV(/)iiu = \Vb{j)\\\-^ < [j D {f)]oi = [ px "(f)] 1 ^ a - ^n)- l c- x \ 



The constant c\ is defined in (5.7) and the last inequality is obtained by the P-continuity of p" and 
the condition on S n - The mean value theorem and the last inequality then imply for any /, / £ J 7 s„ 
and the associated coefficients t and t 



\\t-t\t 



D- 1 o D{f) - D- 1 o D(f) 



<(1 



"W 1 



D(f) - D(f) 



This proves the second claim. 



Proof of Lemma 5. By the definitions of P[_D^(-)] and P°[_D^(-)] in (5.4), we have for any 
/ G J-$ n , any A £ A , and any p £ V 

|P° [£*(/)] 

< f p{x)K h {x) f \p , (a(x)z + f°(x)-f(x))-p , (a(x)z + f*(x)-f(x))\ G(z)dz dx(6A) 
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It additionally holds for all / £ T& n that sup^gy- |/°(a;) — f(x)\ < 5 n . Together with the definition 
of /° in (5.1), this implies for any / <G Fs n 

sup \f*(x) - f{x)\ < sup \f*(x) - f°(x)\ + sup |/°(ar) - f(x)\ < b h + 5 n . 

xev h xev h xev h 

This implies, since p" exists and is continuous with respect to the measure P (sec Definition 1) 
and due to the mean value theorem, that for all h £ H, all A £ A and for all x £ Vh there is a 
u x G IR : \u x \ < bh + 5 n such that 

\p'{a(x)z + f(x) - f(x)) - p'H^z + /*(*) - f(xj) | 
< \f*(x) - f°(x)\ {p"{a{x)z) + 2L p „(b h + S n )) . 



Using \Jbh + S n < Ap'l nin /(2L P "), (6.1), the last inequality, and the definitions of A, bh, and c\ 
defined in (2.2), (2.9) and (5.7) respectively, we obtain for any A £ A 

sup \\P°[Dx(f)}-P[Dx(f)]\\ eac 



< p(x)K h (x)\f*(x)~ f(x)\ / p"(a(x)z)+Ap^ in y/b h + S n G(z)dz dx 



< (1 + Vbh+Jn)cxb h . 



Proof of Lemma 6. In this proof, we use a special case of a deviation inequality derived in [22, 
Corollary 6.9]. Adapted to our needs, this deviation inequality reads as follows: 

Massart's Deviations Inequality: Let X\, . . . ,X n be independent, real valued random variables 
defined on a probability space (ft, A, P). Define S n (Tr) := SiLiM-^) — ^""(A^)) for a set of 
intcgrablc, real valued functions n G 77. If for some positive constants a and b 



supn 1 VEfTr 2 ^)] < 5r 2 and sup \\it(-)\\vo < b, (6.2) 
ttG/7 ~^ Tren 

it holds for any e £ (0, 1] and all z > 



sup S n (Tr) > E + 7aV2nz + 2bz ) < cxp(-z), (6.3) 
Tien 

where 



E := 27V^ / \/H(u) A ndu + 2(6 + a)H(a), 
Jo 

and H(-) is the (P)-entropy with bracketing of 77. 

Recall that the distance w(-,-) is defined in (2.15). We now apply Massart's Inequality (6.3) 
with 



U h JP[X'(f*)] z + \' 00 {nU h )-^ 
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(P [A'(/*)] 2 is given in (2.7)) and 

a = l, b= [ % , H{-) = Hjr. uAtU (.), and S n (ir) = — 
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u h Jp[\'(f*)Y + xUnn h y^ 



to obtain 



and 



ua(u)) A ndu + 2 



+ 1 flj- s „ua, w (1), 



?/. sup 



|J A (/)-P[D A (/)]|k 



< 



E 



/e^„,AeA [ A / (r)] 2 + A ^ /(n n,)V4 ^VTh 

Th\D p x (f)-P[D{(f)}\ 



E / 22 

> — = + 7 



sup 



nn h (niL,) 3 / 4 

> £ + 7(rV2nz + 2bz 



< 2\V\cxp(-z). 

Note that the factor 2 in the last inequality appears because we need to control deviations of the 
absolute value of the empirical process. The claim is now deduced with simple calculations from 
the last display noticing that B z > E/^/n + 7\[2z + 2z(nII/ l )" 1 / 4 . 



Proof of Lemma 7. First, we show that /° is the unique solution of the equality P° [D\(f)\ = 
on T . For this, we consider / € JF such that P° [D\(f)~\ = 0. We then observe that 

$>°-tp)P°[^(/)] =0, 

where t is such that P t = / and t° is defined in (5.1). Since G(-) := n 1 X^Li 9i{ ) i s symmetric, K 
and p are nonnegative, and p' is odd, the last equality implies 



K h {x)n{x)[f{x) - f{x)\ J p'{a{x)z + f°(x) - f(x)) G(z) dz dx = 
«• J K h (x)p(x)\f(x) - f(x)\ J p'(a(x)z + \f°(x) - f(x)\) G(z) dz dx = 
O K^:r)M(z)|/ (z) - f(x)\ J P '(a(x)z + \f°(x) - f(x)\) G(z) dz = Q for all x e V h . 
Thus, if / ^ /°, there exists an open, nonempty set V C Vh such that 

sup f p'(a(x)z + \f(x) - f(x)\) G(z) dz = 0, 



since / and / are continuous. Recall that for any x, J p' (a(x)z)G(z)dz = 0. Since G(z) is a density 
and therefore not translation invariant and since J p" (a(x)z)G(z)dz > 0, this yields 

sup f p'(a(x)z + \f°(x) - f(x)\) G(z) dz = sup - f(x)\ = 0. 



(6.4) 
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This contradicts f ^ f° because / and /o are polynomials of finite degree. In other words, f° is 
the unique solution of P° [D\(f)] = on T. 

We now look at the event {/ A 6 J r g n for all A e A}. To this end, we recall that f x is the solution 
of equation D x (-) = and the following inclusions hold: 

{3X : A i F 5ri } 

"up \\b x {f)-P°[b x {f)}\\ > inf \\P [Dx(f)]\\A 

Q | sup \\D x (f)-P°[D x (f)]\\ > inf ||P°[£a(/)]||,1. 
Next, for any A € A, it holds that 

||£ A (/)-P° [£;,(/)] ||^ 

< |p A (/)-P[£ A (/)]||^ + ||P[£ A (/)] -P°[£ A (/)]||^ 

< |P| ||D A (/)-P[^ A (/)]|| <oo + \V\ \\P[D x (f)}-P [D x (f)}\\ eoo . (6.5) 

We then set t?/, := J" pi(x)Kf l {x)dx and use the continuity of // ot derive, similiarly as in Lemma 
5, for any A e A 

sup \\P[D x (f)] - P°[D x (f)] \\ igB < d h b h . (6.6) 
To control the stochastic term, we can then apply Massart's Inequality (6.3) with tt = /, 
* = 1, ^=4p> ff(')=^uA, u O, and g ra (7r)= n ^ fe(/)-P [£%(/)]) 

V lift 7max/^max > ' 

to obtain ^ 
and 



F f , I sup \\D x (f) - P[D x (f > ;=^£ + 77 max /C max W— — + — 

\fer,\eA 00 "Vll/! V nIL h nIL h 

<$>/• ( SU P |^(/)-P[^(/)]|>g + 7aV2n^ + 2fe) 



< 2|P|exp(-z). 



Setting £ := -E' + 77 max /C max y ^f^- + 2 „ff^ an d E' := 7 '"'™^ [ max _E, we can rewrite the last inequality 
to get 

Using (6.5), (6.6), and the last inequality, we then obtain for all e > 

sup ||£ A (/)-P°[£ A (/)]|| >e|P|] 

\/6^\^„,A6A / 

<2 |- ? l f nh,( £ -g-W \ , , 

" ' ' XP l 98 7l L x /C2 iax + 4 7max /C max ( £ -i?'-^6, l )y'- { - } 
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Now, let us look at infjg^jr^ ,a<eA H-P^-DaC/)]^ in (6.4). By the definition of D A (-), we have for 
any / £ T\Ts n and any A € A 



I^(/)]|U = E 

pG"P 



> 



> 



E 



X — Xf) 
ft 



ll*°-*lk 



^(.t)^^) / p'(a(x)z + f(x) - /(*)) G(z)dzdx 



X — Xq 



H(x)K h (x) / p'(<j{x)z + f(x) - f(xj) G(z)dzdx 



f°(x)-f(x ) 



n(x)K h (x) / p'(a(x)z + f(x)- f(x))G(z)dzdx 



where t is such that / = P(. Since G(-) is symmetric, p'(-) increasing (because of the convexity of 
p), K is nonnegative, and p' is odd (p is symmetric) and positive on R* + (because of p'(0) = 0, the 
convexity of p and the strict convexity around 0), the last equality implies for all / € !F\J : s n 



\P°[Dx(f)]\\ ei 



> 



> 



ll*°-*lk 

\f°(x)-f(x)\ 

\\t°-t\\ 



(j,(x)K h {x) I p'{<j{x)z + \f°(x) - f(x)\) G(z)dzdx 

f(x)-f(xf 



p(x)K h (x) / p' a(x)z + 8 n ± 



P°-*ll 



G(z)dz dx. 



Since |/°(x) — /(x)|||t° — t\\ g ^ < 1, we obtain with the mean value theorem for all / G J~\J~s„ 



\P°[Dx(f)]L > 



|/°(x)~/(x)| 
\\t°-t\\l 



-p(x)Kh(x) inf / p (a(x)z + u) G(z)dz dx 



- Sn mf / II, 119 

t:\\t\\ ei >8„ j wm^ 



p*(z)l 



-p(x)Kh(x) inf / p" (cr(x)z + u) <G(z)dzdx. 
ue[o,5„] ' 



We then derive, using that \/S^ < \ A or""" , and p" is P-continuous, 



inf ||P°LD A (/)]|L >^M^ inf / 
fer\r Sn 11 L AU 'J»<i " 2 t:\\t\\ H >5 n J \\t\\l 



-p{x)Kh(x) dx. 



We then observe that Pt (x) = iX T and thus 

-jj>(x)Kh(x)dx = t 



P*(z)| 



11*11?, 



11*11?, 



p(x)Kh{x)dx 



The matrix J X T X/i(x)A"/ l (x)iix is positive definite (this follows from standart results, see, for 
instance, [30], Lemma 1.6). We can thus write 



11*11?, 



p{x)Kh(x)dx 



* T >*VIH 



where v is the smallest eigenvalue of the matrix J H T lLp(x)Kh(x)dx. In summary, we have 



inf \\P°[D x (f)]\\ > 



2|P| 
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With S n = 2|'P| 2 ((ln?i)- 1 + $ h b h ) / (Ap'^v) , Inequalities (6.4) and (6.7), and the last inequality, 
we obtain 



»/.{3AG A : fx^TsJ < 

< 2|P|exp 



/« b*P \\Dx(f)-P [Dx(f)]\\ ei >^^ 
\feF\Fs n ,\eA l \i J \ 



^((lnn)" 1 - E'f 



987max^iax + 47 ma x£ max ((ln n)" 1 - £'), 

Invoking Condition 1 on n in Section 5.1 and the definition of E', the desired claim follows. 



Proof of Lemma 8. For any x, y > 0, we have 

x q = \x - y + y\ q 

= \[x-y]+ + y\ q l{x >y} + \y~[y- x]+\ q l{x < y} 
< (2«[x - y]% + 2«y«)l{x > y} + y*l{x < y} 
<2 q [x-y]\ + 2"y". 

For the second part, we set x := (xi, . . . , xi) T and use Holder's Inequality to derive 

Nk < i^'IMk 

from which the proof follows. 



Proof of Lemma 9. We first recall by the definition of the estimator (2.13) 



V(A) = 



ILhPn 



A'(/a)1 +A' 00 (nn h )" 1 /4 



Pn\"(fx) 



where 



and 



n h p„ 



a '(A)] 2 = EJtt[^-A(^))] 2 ^ 2/ 

i— 1 



PrX(h) = £ -^- P "(Y t f x (X t ))K (X*-J«L 



nll h 

i — j. 

Then, using Massart's Inequality (given in the proof of Lemma 6) with n(Xi) = — — — - — -j= [p'(Yi~ 
/ W ))]^(^),a = l, b=^, 

jj3/2 

ff(-)=^„uA, w (-) ! z = \nn, and S n (n) = 2 h (p w [A' (f)f - P [X' (f)f) , 



we can control the deviations the process n as follows: 



V sup IL h Pn[X'(f)]-P[X'{f)Y 



^ 7max^max^ln(n) \ ^ n / 2 in 

> rw= - 2 n ' ( 6 - 

Vn-II/j / 
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where 7 max , ^ max and B z are denned in Section (2). Similarly, using Massart's Inequality with 
*W = wdwnr^'W - f™ K * = L ^ = TOT' 

H(-)=H rSn vAA-), * = ani e (n), and S n W = ^ (P n X" (f) - PX" (f)) , 

7max'^max 

we control the deviations of 7r as follows: 

P,. [ sup |P„A"(/) - PA"(/)| > 7max/C ^ ein( " ) ^| < 2/n 2 . (6.9) 

Then, by the continuity of p' and p" almost everywhere, p' < 7 max , < 1, 7max > 1 (see 

Definition 1), /, /* € and the mean value theorem, we have for all / £ Ts n 



1 " / y _ 

P [A'(/)] 2 - P [A'(/* )] 2 | < -i- |p'(K, - /(X 4 )) 2 - pf(Yi f*{Xi)?\K* (-^y^ 



Similarly, 



»=i 

<2 7 {Sn + b h ). 



sup |PA"(/) - PA"(/*)| < Lp.fC^Sn + b h ) < V7ma X /C max (<5„ + b h ). 



Denote by s n := (2 V ip")7max^max(^n + bh) + 7max/C """' Bl " ( " ) , Moreover, we observe (under Con- 
dition 2 on n) that 



and thus 



s n < a n max ^Ap'^ in J K h (x)p(x)dx, U h inf P [A' (/*)]' 



s n < a„max inf P\"(f),IL h inf P [A' (/*)]' 

1 A6A A6A 



Using this, (6.8), and (6.9), we obtain for A 6 A with probability 1 - 2/n 2 - P/*(^ c ) 



and 



PX"[f*)-s n l-a n 

This proves the claim. 

Proof of Lemma 10. We recall that 

V h {t) = j K h {x) J p'((TZ + f*(x)-t)G(z)dzdx, 

and thus, with the mean value theorem, there exists ac£ [t,t] such that 

XV, (f) - 2? h (t) = (t- 1) / / /5"(az + /*(s) - c)G(z)cbcte. 



38 Chichignoud & Lederer 

As t, t e [/*(x ) - S n , f*{x ) + 5 n ] and /* G M d 0, L, M), we have for any x G V h 

\f*(x)-c\ < \f*( x )-f*( Xo )\ + \f*(x )-c\<b hm +5 n . (6.10) 
Using Cp = J p"((Tz)G(z)(iz and the previous two inequalities, we obtain 
\V h {t)-V h (t)\ 

= \t- t\ J K h {x) J p"(<jz + f*(x) - c)G(z)dzdx 

= \t- t\ J K h (x) J p"(oz + f*(x) - c)G(z)dzdx - j K h (x) j p" (az)G(z)dzdx + cp 
As p" is Lp/z-Lipschitz, we obtain with (6.10) and Condition 1 in Section 5.1 

K h (x) J p"(crz + f*(x) - c)G(z)dzdx - J K h {x) J p" (az)G(z)dzdx 



<Lp„ J K h (x)\f*(x)-c\dx 
< L p „(b hm ^+6 n ) 



We then deduce from the last two displays that 



\V h (t) - V h (t)\ > c p (l - v /6 hm „ +S n )\t - t\ 
and with Condition 1 finally 



\t - t\ < (1 + 2y/b hB1 „ + 6 n )cf\V h (t) - V h (i)\. 



Proof of Lemma 11. Let us first set for any h G Ti 



Th 



Thus, with Lemma 8, we obtain 



E/.c- 9 sup 



any/y{p*,K*)(Bo + ani 6 (n)) 



Vh(f)-T>h(f) 



<2m f .[c} sup T> h (f)-V h (f) 



7~h 



Next, note that c^ 1 



X>h(-) - ©/»(■) < 27 max /C max (Ap^ in )- 1 =: T. Consequently, 
E f Jcf sup V h (f)-V h {f) 



u 9-1 ?/. ( c^ 1 sup 



77, 



T>h(f)-T> h (f) 



(6.11) 



(6.12) 
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Similarly as in (5.14), we derive on the event {A,Vh £ H e } 

where \* h (x,y,f) := p* (y - f(x)) K* h (x) and X h (x,y,f) := p(y - /(&)) K h (x) for all x,y e M. 



Setting u 



/nUh 



e in (6.12), using the last inequality, and Lemma 6 with z > such that 



B a + ani e (n) + e = B z , we get 

EfJcf sup f> h {f)-V h {f) 



n 



< qrl I s«- l t f , 

\ /G^„ 



< qrl I e 9_1 P/* I sup sup 



v h (f)-v h (f) 

V h (f)-V h (f) 



? V(\ h )(B + ani e (n) + e) 
£>o + ani e (n) + e 



o 



<2qrl J\"- I exp\ 



(e + ani £ (n)) 2 



> 



100 + 4(e + ani e (n))/(nn^) 1 /4 



ds 



de 



Using ani e (n)/(nll ft ) 1/4 < 1, (27 max /C max (/9^ in A)" 1 )/(rin ft ) 1/4 < 1 (Condition 3 on n in Section 
5.1), and (5.10) with a = 104 and b = 4, we get 



< 2(?t^ exp 



A(f)-T> h (f) 



(ani e (n))' 



108 



77, 



e 9 1 exp 



104 + 4e 



de 



< 



4^(11. 



(11.4) 9 Gamma( 9 ). 



n|W e | h 

From (6.11) and the last inequality, the lemma can be deduced. 



Proof of Lemma 12. Recall that we consider the uniform design and the homoscedastic noise 
level. By the definition of T>h and with a change of variables, we have 

sup |2W(/)-X> h (/)| 



sup 



K(x) J p'(oz + f*(x + /W h'x) - f(x ))G{z)dz dx 
K{x) I p 1 (az + /* (xq + hx) — f(xo))Gr(z)dz dx , 
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where G(-) = - Y^i=i 9i(')- Using / € Hd(/3, L, M), the Li -continuity of p" , the last equality, and 
the mean value theorem, we obtain: 

sup \V h , vh (f)-V h (f)\ 

f^S n 

< sup p"(az + s)G(z)dz K(x)\f*(x + hyh'x)-f*(x + hx)\dx 

|s|<25„+26 hmax J J 



7 = 1 



With Condition 1, this yields 



sup 



\-Dh>vh(f)-V h (f)\ < (1 + y/6 n + b hm Jc ii LY l {h' j ) f,i 



3=1 
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